DS1000: by examples

Home   Doc/Code

Not solved by any model

There are 159 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/105, DS/106, DS/107, DS/108, DS/121, DS/122, DS/131, DS/132, DS/142, DS/15, DS/159, DS/172, DS/173, DS/174, DS/178, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/236, DS/242, DS/244, DS/245, DS/250, DS/26, DS/263, DS/269, DS/270, DS/272, DS/280, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/338, DS/339, DS/345, DS/354, DS/362, DS/372, DS/373, DS/374, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/411, DS/416, DS/418, DS/42, DS/420, DS/421, DS/427, DS/43, DS/439, DS/44, DS/440, DS/45, DS/458, DS/46, DS/463, DS/468, DS/48, DS/488, DS/509, DS/515, DS/516, DS/521, DS/526, DS/54, DS/56, DS/567, DS/57, DS/571, DS/58, DS/585, DS/59, DS/596, DS/6, DS/60, DS/612, DS/621, DS/622, DS/626, DS/635, DS/638, DS/65, DS/67, DS/672, DS/694, DS/699, DS/7, DS/701, DS/726, DS/73, DS/74, DS/744, DS/747, DS/75, DS/755, DS/763, DS/772, DS/773, DS/775, DS/776, DS/779, DS/780, DS/781, DS/789, DS/79, DS/790, DS/798, DS/799, DS/8, DS/80, DS/808, DS/809, DS/81, DS/815, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/90, DS/900, DS/901, DS/904, DS/905, DS/922, DS/926, DS/927, DS/953, DS/96, DS/984, DS/987, DS/993, DS/997, DS/998

Problems solved by 1 model only

example_link model min_elo
DS/677 gpt-4-turbo-2024-04-09 1197.332
DS/505 gpt-4-turbo-2024-04-09 1197.332
DS/743 gpt-4-turbo-2024-04-09 1197.332
DS/304 gpt-4-turbo-2024-04-09 1197.332
DS/348 gpt-4-turbo-2024-04-09 1197.332
DS/946 gpt-4-turbo-2024-04-09 1197.332
DS/357 gpt-4-turbo-2024-04-09 1197.332
DS/253 gpt-4-turbo-2024-04-09 1197.332
DS/104 gpt-4-turbo-2024-04-09 1197.332
DS/129 gpt-4-turbo-2024-04-09 1197.332
DS/282 gpt-4-turbo-2024-04-09 1197.332
DS/222 gpt-4-turbo-2024-04-09 1197.332
DS/134 gpt-4-0613 1149.486
DS/765 gpt-4-0613 1149.486
DS/903 gpt-4-0613 1149.486
DS/130 gpt-4-0613 1149.486
DS/386 gpt-4-0613 1149.486
DS/39 gpt-4-0613 1149.486
DS/422 gpt-4-0613 1149.486
DS/807 gpt-4-0613 1149.486
DS/902 gpt-4-0613 1149.486
DS/154 gpt-4-0613 1149.486
DS/774 gpt-4-0613 1149.486
DS/784 gpt-4-0613 1149.486
DS/751 deepseek-ai-deepseek-coder-6.7b-instruct 1097.694
DS/243 deepseek-ai-deepseek-coder-6.7b-instruct 1097.694
DS/66 deepseek-ai-deepseek-coder-6.7b-instruct 1097.694
DS/749 deepseek-ai-deepseek-coder-6.7b-instruct 1097.694
DS/750 deepseek-ai-deepseek-coder-6.7b-instruct 1097.694
DS/995 microsoft-wavecoder-ultra-6.7b 1093.104
DS/679 microsoft-wavecoder-ultra-6.7b 1093.104
DS/681 microsoft-wavecoder-ultra-6.7b 1093.104
DS/201 m-a-p-OpenCodeInterpreter-DS-6.7B 1045.508
DS/806 m-a-p-OpenCodeInterpreter-DS-6.7B 1045.508
DS/671 m-a-p-OpenCodeInterpreter-DS-6.7B 1045.508
DS/55 codex002 1016.928
DS/87 gpt-3.5-turbo-0125 1003.268
DS/582 gpt-3.5-turbo-0125 1003.268
DS/447 m-a-p-OpenCodeInterpreter-CL-7B 1002.059
DS/51 m-a-p-OpenCodeInterpreter-CL-7B 1002.059
DS/153 gpt-3.5-turbo-0613 1000.000
DS/764 gpt-3.5-turbo-0613 1000.000
DS/813 gpt-3.5-turbo-0613 1000.000
DS/200 m-a-p-OpenCodeInterpreter-SC2-3B 988.615
DS/199 m-a-p-OpenCodeInterpreter-SC2-3B 988.615
DS/93 m-a-p-OpenCodeInterpreter-SC2-3B 988.615
DS/812 m-a-p-OpenCodeInterpreter-SC2-7B 973.125
DS/604 m-a-p-OpenCodeInterpreter-SC2-7B 973.125
DS/388 m-a-p-OpenCodeInterpreter-SC2-7B 973.125
DS/766 m-a-p-OpenCodeInterpreter-SC2-7B 973.125
DS/188 ibm-granite-granite-8b-code-base 945.349
DS/157 ibm-granite-granite-8b-code-base 945.349
DS/227 meta-llama-Meta-Llama-3-8B 913.457
DS/899 meta-llama-Meta-Llama-3-8B 913.457
DS/305 deepseek-ai-deepseek-coder-6.7b-base 905.448
DS/654 meta-llama-CodeLlama-13b-Python-hf 901.902
DS/224 microsoft-wavecoder-pro-6.7b 901.057
DS/474 microsoft-wavecoder-pro-6.7b 901.057
DS/996 microsoft-wavecoder-pro-6.7b 901.057
DS/240 ibm-granite-granite-8b-code-instruct 901.047
DS/241 ibm-granite-granite-8b-code-instruct 901.047
DS/462 google-codegemma-1.1-7b-it 882.609
DS/346 meta-llama-Meta-Llama-3-8B-Instruct 876.266
DS/164 meta-llama-CodeLlama-7b-Python-hf 827.560
DS/27 claude-3-sonnet-20240229 789.815
DS/95 claude-3-sonnet-20240229 789.815
DS/867 meta-llama-CodeLlama-7b-hf 774.028
DS/161 meta-llama-CodeLlama-7b-hf 774.028
DS/281 gpt-4o-2024-05-13 751.086
DS/264 gpt-4o-2024-05-13 751.086
DS/887 microsoft-phi-2 750.621
DS/533 Qwen-CodeQwen1.5-7B-Chat 737.466
DS/165 mistralai-Mistral-7B-Instruct-v0.2 730.941
DS/64 Salesforce-codegen25-7b-instruct_P 705.105
DS/886 google-codegemma-1.1-2b 661.894
DS/424 google-codegemma-2b 594.285

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
DS/880 0.302 -0.310
DS/392 0.429 -0.225
DS/882 0.111 -0.189
DS/470 0.032 -0.164
DS/611 0.286 -0.132
DS/424 0.016 -0.127
DS/523 0.032 -0.123
DS/886 0.016 -0.109
DS/64 0.016 -0.095
DS/881 0.270 -0.092

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.